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State testing programs regularly release examples of test items to the public. These releases 
serve multiple purposes. They provide educators and students an opportunity to familiarize 
themselves with item formats. They demystify the testing experience for the public. And they 
can improve understanding of test scores by illustrating the kinds of tasks that students at 
particular achievement levels can accomplish successfully. As exemplars, these items are 
typically screened carefully, with demonstrated alignment to state content standards. They are 
generally evaluated at great expense in operational administrations and field tests. They have 
known quality and technical characteristics. However, states generally release the items 
themselves, not their technical characteristics. This prevents any use of released items to 
estimate scores on state scales. 


This is generally wise. Released items have unknown exposure and unknown familiarity, and 
uncontrolled conditions in any re-administration would risk standard inferences about 
proficiency. State testing programs are rightfully hesitant to sanction any uses of released items 
to protect against coaching that would inflate scores ona typical administration. However, at 
this writing in August of 2020, there are serious threats to any notion ofa typical 
administration, and there is a dearth of high-quality assessment options. In this current 
pandemic, we argue that states should make technical parameters of released items public to 


support low-stakes uses of standards-based test score reports. The cost is negligible, and all 
assessment options should be available to educators for educational monitoring purposes. In 
this article, we provide a recipe for construction of tests using released items and provide 
guardrails to ensure appropriate use in an educational crisis. 


Assessment in the COVID-19 Crisis 


In the spring of 2020, COVID-19 caused U.S. school districts to cease in-person instruction 
months earlier than usual. The first states closed schools on March 16, and all states had 
recommended school closure by March 24 (Education Week, 2020). Remote instruction has 
differed substantially between and within statesin implementation and uptake (Harris et al., 
2020). As schools openin-person and online in the fall of 2020, unusual numbers of students 
may not have learned nor had the opportunity to learn previous grade material. 


Although projections exist for the magnitude of declines and possible increases in disparities 
(Kuhfeld et al., 2020), assessments can provide a more direct estimate this school year. Results 
of such interim assessments can inform strategies to support teachers and students, including 


funding, curriculum redesign, and instruction (Perie, Marion, & Gong, 2009). 


COVID-19 is an international health disaster, and standardized measures of proficiency in 
reading, writing, mathematics, and other subjects should be tertiary to other assessment 
targets and assessment purposes (Lake & Olson, 2020; Marion, Gong, Lorié, & Kockler, 2020; 
Olson, 2020). There is a hierarchy of assessment needs in acrisis, and measures of academic 
levels should rightfully be tertiary. Higher priorities and assessment approaches should include: 


e Teacher- or parent-reported surveys of students’ spring attendance, participation, and 
content coverage. In many schools with remote instruction, teachers and parents can 
report their impressions of attendance, participation, and proficiency compared to prior 
years. 


e Existing classroom and district assessments. Districts already have access to classroom 
assessments that can assess prior-grade material. Some district-level assessments have 
fall tests that can report scores linked to state proficiency standards. 


e Assessments of physical, mental, and social-emotional health, sufficient levels of which 
are necessary conditions for learning. 


As an optional supplement to these approaches, school and district educational personnel may 
also find aggregate summaries of student proficiency in terms of state performance standards 
useful. For example, a school or district may recognize due to other assessments listed above 
that substantial units or students had no access to material taught at the end of the year, 
motivating some weeks of review of prior-grade content. A test comprised of previously 


released, prior-grade items would enable estimation of proficiency distributions on prior-grade 
score scales, including proficiency in terms of achievement level cut scores. 


Although some districts have access to assessments that report on state test score scales, 
usually through statistical projections, such assessments are costly and not universal. Tests 
comprised of released items are free and interpretable directly in terms of state achievement 
levels. We also show how item maps comprised of released items can provide educators with 
examples of performance tasks that students in each achievement level can do. We provide an 
explicit recipe for such tests,; then we conclude with clear guardrails for appropriate use. In 
particular, we caution that any current use (or implied future use) of these scores for judgments 
about student tracking, educator effectiveness, orschool effectiveness would invite severe bias 
and inflation that would render scores unusable for those high-stakes purposes. 


Availability of Released Items and Parameter Estimates 


Interestin the reuse of calibrated items surged in the 1990s as the National Assessment of 
Educational Progress (NAEP) began reporting state results. The term “market-basket reporting” 
(National Research Council, 2000) was considered and discarded, and authors demonstrated 
how “domain scores” using Item Response Theory could support reuse of calibrated items 
(Bock, Thissen, & Zimowski, 1997; Pommerich, 2006). More recently, there has been 
international interest in creating tests for administration across different countries and 
conditions (Das & Zajonc, 2020; Muralidharan, Singh, & Ganimian, 2019). We could not find a 
straightforward recipe for creating such tests nor an article that discussed application and 
caveats in a crisis. 


Unfortunately, in our search of publicly available manuals, we found few examples of state 
technical manuals that enable users to merge published items to published estimates. This does 
not appear to bean intentional omission. Rather, state testing program personnel may reason 
that released items have an audience that is not interested in technical specifications, and item 
parameter estimates have an audience that is not interested in item content. We hope that it 
becomes standard practice to either publish item parameter estimates with released items or 
include a key that enables merging of released items with parameter estimates in technical 
manuals. 


Table 1 shows whether the key ingredients for reuse of items are available across large testing 
programs and states. The ingredients are available for large national and international 
programs like NAEP, PISA, and TIMSS. We also conducted a search of state websites for the 15 
largest states, for items, parameter estimates, and a key linking the two. We find that these 
state testing programs always make operational items available, in the case of some states, 
through the assessment consortia known as Smarter Balanced and New Meridian (which was 
related to the Partnership for Assessment of Readiness for College and Careers, PARCC). We 
found item parameter estimates in a fewstates. A key that enablesa merge of the two key 
ingredients was only available for the New York Regents (a longstanding high school testing 


program) and in Ohio, where the necessary information was largely available but seemed 
unintentional and based on item order rather than item IDs. 


Table 1. Online public availability of items and parameter estimates for the construction of 
open tests 


1) Are operational (or 2) Are item 3) Is akey enabling a 
field tested) items parameter estimates merge of 1) and 2) 

Testing Program available? available? available? 
NAEP Yes Yes Yes 
PISA Yes Yes Yes 
TIMSS Yes Yes Yes 
Smarter Balanced Yes No No 
New Meridian (PARCC) Yes No No 
California Yes No No 
Texas Yes No No 
Florida Yes No No 
New York Yes 3-8 & Regents Yes 3-8 & Regents No 3-8; Yes Regents 
Pennsylvania Yes Yes No 
Illinois Yes No No 
Ohio Yes Yes Haphazardly 
Georgia Yes No No 
North Carolina Yes No No 
Michigan Yes No No 
New Jersey Yes No No 
Virginia Yes No No 
Washington Yes No No 
Arizona Yes Yes No 
Massachusetts Yes Yes No 


Table 1. Online public availability of items and parameter estimates for the construction of 
open tests, for selected large-scale, national and international testing programs and programs 
from the 15 largest states as of August, 2020. This table will be updated online at 
https://emmaklugman.github.io/files/open-tests.html. 


Ingredients for Test Construction Using Released State Test Items 


For this example, we consider a possible use of Grade 4 items to estimate Grade 4 proficiency 
for Grade 5 students in a COVID-19-disrupted year. This illustrative example is available in our 


Online Appendix, complete with code in R. We use the National Assessment of Educational 
Progress (NAEP) for publicly available ingredients. In practice, ingredients from state tests will 
be preferable given the relative curricular and political relevance of state standards and state 
score scales. The recipe for standards-linked test scores requires five essential ingredients: 


Test items 

Item parameter estimates 

A list or key enabling association of items and their corresponding estimates 
Linking functions from underlying @ scales to scale scores 

Achievement level cut scores 
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Starting with the first ingredient, designers should ensure selection of items that suits their 
desired content coverage. Although the restrictive assumptions of Item Response Theory 
suggest that the selection of items has no effect on score estimation (Yen & Fitzpatrick, 2006), it 
is reasonable to select items in similar proportion to test blueprints, or some subset of items 
from a content area in which educators have particular interest. As we note in our section 
about caveats, state tests are typically administered at the end of a sequence of related 
instruction. If tests are not given in a similar sequence and conditions, standard inferences may 
not apply. Thus, a presentation or review of Grade 4 material that mimics the standard 
instructional onramp to Grade 4 testing would help to ensure appropriate inferences from 
scores. 


The second ingredient is item parameter estimates. These are an occasional feature of technical 
manuals for state tests. Turning to the third ingredient, as we mention above, alink is rarely 
available with the exception of large-scale programs like NAEP, TIMSS, and PISA, and one-off 
examples like the New York Regents Exams and Ohio. 


The fourth ingredient is a linking function, usually a simple linear equation for each score scale 
that maps from item parameter estimates on the underlying @ scale to the scale scores for 
reporting. Fifth and finally, achievement level reporting, in categories like Basic, Proficient, and 
Advanced, requires cut scores delineating these levels. Both linking functions and achievement 
level cut scores are reported regularly in state technical manuals and documentation. 


Recipe for Test Construction Using Released State Test Items 


The recipe for generating standards-based score reports from the ingredients above requires 
straightforward application of Item Response Theory. The recipe is available online at 
https://emmaklugman.github.io/files/open-tests.html and assumes expertise at the level of a 
first-year survey course in educational measurement. Reviews of IRT include those by Yen and 
Fitzpatrick (2006) and Thissen and Wainer (2001). Many state technical manuals also review 
state-specific scoring procedures and technical details. 


We use acommon and straightforward procedure knownas Test Characteristic Curve (TCC) 
scoring method that results in a 1-to-1 table of summed scores to 8 estimates and scale scores. 
Kolen and Tong (2010) compare this approach with other alternatives. They note that the TCC 
approach is both transparent and avoids the dependence of scores on priors, which may offset 
the tradeoffs of the slight increase in imprecision. Users may substitute alternative scoring 
approaches into this recipe. 


Given the ingredients listed in the previous section, the recipe follows: 


Arrange released test items into an online or paper booklet. 

Generate a table mapping summed scores to scale scores. 

Administer the test and collect responses. 

Sum correct responses to summed scores and locate corresponding scale scores. 
Report scale scores, including achievement levels and item map locations as desired. 
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Test items should be arranged to support a natural flow of content and difficulty. For items 
where item locations are known, test constructors may try to preserve relative item order. For 
more on principles of test design, see Downing and Haladyna (2006). 


To create a table mapping summed scores to scale scores, we reproduce a standard recipe to 
sum item characteristic curve function to atest characteristic curve, invert it, and then 
transform the result linearly to estimate scale scores. For simplicity, consider a dichotomously 
scored 3-parameter-logistic model: 
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Here, each examinee’s dichotomous response X to item i depends upon examinee proficiency 
6 and item parameters a,b, and c, indicating information (discrimination), location (difficulty), 
and a lower asymptote (pseudo-guessing), respectively. Many models include an arbitrary 
scaling parameter, D = 1.7, which should simply be included or excluded for consistency. The 
sum of these item characteristic curves yields the test characteristic curve: 


T(6) = > P,(6). 


This sum of probabilities is the expected sum score given known examinee proficiency 0. 
Inverting the test characteristic curve using numerical interpolation methods yie lds the TCC 
estimate of 6 for any summed score. 


Orcec = T-1 (> x} 
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Transformations to scale scores s are typically linear, and constants for the slope and intercept 
(M and K, respectively) are often available in technical manuals: 


§=M6+K. 
States also publish achievement level cut scores denoting minimum threshold scores for 
categories. For NAEP, these achievement level labels are Basic, Proficient, and Advanced and 
delineated by cut scores in each subject and grade: cg, Cp, and cy. Ascale score s is assigned an 
achievement level category L in straightforward fashion: 


"Advanced" if s > c, 
"Proficient" if cp <s < Cy 
"Basic" if cp < s < Cp 
"Below Basic" if s < cg 


Lisy= 


Finally, item maps can illustrate items and tasks that examinees at each score are likely to be 
able to answercorrectly. Each item is anchored to the @ scale assuming a given probability of a 
correct response, knownas the response probability, pp. This can be set to various levels like 
0.67 (Huynh, 2006) or, in our example here and online, 0.73. The item response function is then 
inverted and transformed to the score scale to obtain each item’s mapped location, s;. Under 
the assumptions of IRT, any item from the domain can be mapped, evenif it was not 
administered to students. 
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This recipe results in Table 2, using real data from NAEP. Each summed score aligns with a 
single underlying proficiency estimate 0, scale score §, achievementlevel, and nearby mapped 
item. This recipe is online and available at https://emmaklugman.github.io/files/open- 
tests.html, complete with open-source code in R. Although we recommend scores for 
aggregate-levelinferences, we also include estimates of standard errors for each individual- 
level scale score using Item Response Theory. 


Table 2. Sum scores, estimated 0 scores, scale scores, achievement levels, and item maps 
with content areas shown. 


Sum Theta Scale Achievement Subscale Item 

Score Score Level 

8 -2.48 162 Below Basic Geometry Identify a figure that is not... 

9 -2.01 177 Below Basic Geometry Divide a square into various... 
10 -1.65 188 Below Basic Measurement Identify appropriate... 

11 -1.36 198 Below Basic Measurement Identify a reasonable amount... 
12 -1.10 206 Below Basic Operations Identify the place value of a... 
13 -0.88 213 Below Basic Operations Recognize the result of... 

14 -0.68 219 Basic Operations Compose numbers using place... 
15 0.49 225 Basic Operations Represent the same whole... 


16 -0.32 231 Basic Operations Subtract three-digit number from... 


17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 


-0.15 
0.01 
0.17 
0.33 
0.49 
0.65 
0.82 
1.00 
1.19 
1.40 
1.64 
1.93 
2.32 
2.95 


236 
241 
246 
251 
256 
262 
267 
273 
279 
286 
293 
303 
315 
335 


Basic 

Basic 

Basic 

Proficient 
Proficient 
Proficient 
Proficient 
Proficient 
Proficient 
Advanced 
Advanced 
Advanced 
Advanced 
Advanced 


Algebra 
Algebra 
Algebra 
Geometry 
Measurement 
Algebra 
Operations 
Measurement 
Analysis 
Algebra 
Operations 
Algebra 
Operations 
Geometry 


Solve a one-variable linear... 
Determine the missing shapes in... 
Marklocations on a grid... 

Use an interactive tool to create... 
Determine perimeter of a... 
Determine and apply arule... 
Represent fractions using a... 
Identify given measurements on... 
Determine number of ways... 
Determine and apply arule... 
Solve a story problem involving... 
Relate input to output from a... 
Compose numbers using place... 
Divide a square into various... 


Table 2. Sum scores, estimated @ scores, scale scores, achievement levels, and item maps with 
content areas shown. Ingredients are from the National Assessment of Educational Progress 
and the National Center for Education Statistics. The recipe is available at 


https: 


Discussion: Cautions and Caveats 


emmaklugman.github.io/files/open-tests.html. 


We close with a series of caveats. One set of caveats relates to the interpretation and use of 
individual scores. A second set of caveats builds upon the first, with additional threats to the 
comparability of aggregate scores to past years. Users of these tests in a crisis may try to 
answer two important descriptive questions: 1) How much have scores declined? 2) How much 
have score disparities grown? Answers to these questions must attend to these sets of caveats. 


First, in a crisis, many physical and psychological factors may threatena typical administration 
and introduce construct-irrelevant variance. We cannot emphasize enough the appropriately 
tertiary and supplemental role of the tests that we propose here. Physical health and safety 
must come first in a crisis, followed by assessments of social and emotional well-being. 
Students must be safe and feelsafe before they can learn or demonstrate what they have 


learned. 


Second, when many students are working from home, online test-taking in different 
administration conditions are a threat to comparability. Complicating factors in home 
administrations include online connectivity, parental involvement, and other in-home 
interference or distractions. Such factors can inflate scores if, for example, parents assist 
students, or students use additional online resources. They can deflate scores if there are 
atypical distractions or poor internet connectivity. 


Third, these tests typically follow standardized instructional on-ramps at the end of a year of 
instruction. Irregular or inconsistent exposure to instruction prior to administration will 
threaten standard interpretations of scale scores. For example, consider a fall administration 
that follows a fall instructional unit where teachers emphasize algebra over other domains like 
geometry or measurement. Resulting scores may lead users to underestimate algebra 
proficiency, when in fact the scores reflect relatively low proficiency in other domains. 


Additional threats to inferences arise at the aggregate level, to the extent that the population in 
school in a crisis may not be the same as in years past. Students who are not in school in a crisis 
are not missing at random. Standard interpretations of trends and gap trends will be 
threatened to the extent that the population of students in school does not match the 
population of students who would have beenin school absent the crisis. Matching based on 
scores from past years and other covariates may help to address some of this bias, but sucha 
procedure risks precision and transparency. 


The use of existing classroom and interim assessments will also require similar caveats above. 
The one important exception is the third caveat, where classroom and district assessments may 
have more flexible and appropriate instructional onramps. However, high-quality district 
assessments are not available to all districts, and these are not always directly interpretable in 
terms of state content and performance standards. 


Thus, in spite of these necessary caveats, we emphasize that state testing programs already 
make high-quality ingredients for usefultests available to the public, and we provide a recipe as 
wellas guardrails for appropriate use. We encourage states to release the currently missing 
ingredient, a key for merging items with parameter estimates. The cost would be negligible. All 
low-stakes assessment options should be available to schools and districts in a crisis. 
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